An examination of psychometric properties of study quality assessment scales in meta-analysis: Rasch measurement model applied to the firefighter cancer literature

Most existing quality scales have been developed with minimal attention to accepted standards of psychometric properties. Even for those that have been used widely in medical research, limited evidence exists supporting their psychometric properties. The focus of our current study is to address this gap by evaluating the psychometrics properties of two existing quality scales that are frequently used in cancer observational research: (1) Item Bank on Risk of Bias and Precision of Observational Studies developed by the Research Triangle Institute (RTI) International and (2) Newcastle-Ottawa Quality Assessment Scale (NOQAS). We used the Rasch measurement model to evaluate the psychometric properties of two quality scales based on the ratings of 49 studies that examine firefighters’ cancer incidence and mortality. Our study found that RTI and NOQAS have an acceptable item reliability. Two raters were consistent in their assessment, demonstrating high interrater reliability. We also found that NOQAS has more items that show better fit than the RTI scale. The NOQAS produced lower study quality scores with a smaller variation, suggesting that NOQAS items are much easier to rate. Our findings accord with a previous study, which conclude that the RTI scale was harder to apply and thus produces more heterogenous quality scores than NOQAS. Although both RTI and NOQAS showed high item reliability, NOQAS items are better fit to the underlying construct, showing higher validity of internal structure and stronger psychometric properties. The current study adds to our understanding of the psychometric properties of NOQAS and RTI scales for future meta-analyses of observational studies, particularly in the firefighter cancer literature.


Introduction
Assessment of study quality is a critical aspect of conducting meta-analyses. Study quality considerably varies across studies and may lead to heterogeneity in study findings [1][2][3][4][5] and Bravo warned that overall effect size estimates obtained from meta-analysis that do not account for variation in study quality may suffer from increased Type I error rates [6]. In addition, other factors investigators ought to be concerned when evaluating studies include the sources, directions, and even plausible magnitudes of such biases [7,8]. Therefore, many researchers suggest that the quality of primary studies should be accurately assessed and used in meta-analysis [5,7]. Despite the importance of assessing study quality in general, many researchers have identified challenges in dealing with the quality of primary studies in metaanalyses [3,6,9]. One of the critical issues is that while there are a variety of scales to assess the quality of primary studies, none has been universally adopted [10]. In fact, there is no consensus about how study quality should be conceptualized or measured in the existing quality scales [5,[11][12][13]. Moreover, most existing quality scales have been developed with minimal attention to accepted standards of psychometrics properties such as reliability and validity [14]. Most of the research has focused on interrater reliability measures, such as kappa statistics or percentage of agreement, rather than item reliability, content validity, or construct validity. In addition, even for those that have been used widely in medical research, no to little evidence exists supporting their psychometric properties. Therefore, the focus of our current study is to address this gap by evaluating the psychometrics properties (i.e., item reliability, interrater reliability, and construct reliability) of existing quality scales that are frequently used in cancer observational research: (1) Item Bank on Risk of Bias and Precision of Observational Studies developed by the Research Triangle Institute (RTI) International [15] and (2) Newcastle-Ottawa Quality Assessment Scale (NOQAS) [16]. Specifically, we used the Rasch measurement model [17] to evaluate the psychometric properties of these two quality scales based on the ratings of 49 studies that examine firefighters' cancer incidence and mortality. The present study is focused on three primary research questions, namely: 1. Can the RTI or the NOQAS scale be considered reliable?
2. Do the items of RTI or NOQAS fit the overall quality score? 3. Do the individual studies fit the overall quality score?

Study quality
Two different frameworks have been proposed in the literature to define and measure the quality of primary studies in meta-analysis [18]. One is based on the validity framework developed by Campbell and his associates [19] and the other, called "quality assessment", was proposed by Chalmers and his colleagues [8].
The former approach, based on the idea of Campbell and his associates, suggests a matrix of designs and their features or threats to validity. The validity framework includes 33 separate threats to validity based on four distinct categories: internal, external, statistical, and construct validity [18]. This validity framework for assessing the quality of primary studies in a metaanalysis is mainly used in the social sciences. For instance, Devine and Cook [20] evaluated the quality of primary studies based on the validity framework by examining six design features representing internal, external, statistical, and construct validity (e.g., floor effect, publication bias, attrition, and domains of content).
The second approach, proposed by Chalmers and his associates, has been applied primarily to medical research [8,18,[21][22][23][24]. The objective of Chalmers' system is to quantify the overall quality of primary studies based on in-depth criteria for assessing randomized controlled trials. Chalmers and his colleagues mainly focused on construct validity and statistical conclusion validity, examining such features as randomization, blinding of the statistician, and minimization of data-extraction bias [18].

Study quality assessments in observational studies
An informal PubMed search of published meta-analyses and systematic reviews in the cancer literature revealed that the Newcastle-Ottawa Quality Assessment Scale (NOQAS) [16] was the widely employed tool for review articles which focused on risk factor association studies [25][26][27][28][29][30][31]. This tool was employed in a recent meta-analysis of the firefighter cancer literature [32]. The second identified assessment tool was less commonly employed in cancer-focused metaanalyses and systematic reviews [33,34]: the Research Triangle Institute (RTI) International and Item Bank on Risk of Bias and Precision of Observational Studies [35,36]. Although not commonly employed in cancer meta-analyses [33], it has been utilized in a variety of syntheses of other disease outcome association studies [36][37][38][39][40][41] and was employed in a systematic review of lung function in firefighters [42]. Of note, some investigators have employed both the NOQAS and the RTI item bank to assess quality in meta-analyses and systematic reviews [37,42,43]. The RTI item bank is comprised of 29 multiple-choice questions that is designed to assess a range of risk of bias and precision domains for a variety of observational study designs [36,37]. These domains include: sample definition and selection, interventions/exposure, outcomes, creation of treatment groups, blinding, soundness of information, follow-up, analysis comparability, analysis outcome, interpretation, and presentation and reporting. Investigators are encouraged to select items from the bank that are most appropriate to the content area and study design of studies under assessment.
The 8-item NOQAS was developed to assess the quality of nonrandomized studies with specific assessment forms for case-control and cohort study designs [16]. Several questions are designed to be tailored for use given the content being assessed. A simple summary quality score can be obtained by summing each individual item judged to be of high quality, although given its relatively short length investigators often report quality levels for the individual 8 items for each study under review. The NOQAS has been recommended for use for the assessment of quality of observational study designs [41,44].
Our literature search using PubMed, PsycInfo and Medline resulted in one published study that compares the psychometric evidence of NOQAS to RTI. In a study by Margulis and her colleagues [40], two raters independently assessed the quality of 44 primary studies with RTI and NOQAS. After coding the quality of studies, Margulis and her colleagues computed interrater agreement using percentage of agreement and the first-order agreement coefficient statistics. In their study, the relationship between NOQAS and RTI for ranking ordering studies in terms of risk of bias was found to be medium, as indicated by the Spearman's rank correlation coefficients of .35 and .38. Also, authors stated that NOQAS is easier to apply than the RTI item bank, but more limited in its scope, although the scope of quality was similar between NOQAS and RTI. Lastly, the interrater reliabilities between raters were reported to be fair for both NOQAS and RTI.
Like a study by Margulis and her colleagues [40], a few published studies addressed the quality of either NOQAS or RTI, using interrater reliability measures such as kappa statistics or percentage of agreement between raters [41,44]. Likewise, all these studies evaluated the psychometric properties of the quality assessment tools under the Classical Test Theory (CTT) framework, which is somewhat simple by analyzing the raw data of the instrument. Also, most of the existing studies focused on interrater reliability or face validity of items used to measure the quality of individual studies.

Rasch measurement model
Whereas classical test theory (CTT) has been frequently used in evaluating the validity and reliability of study quality ratings, some issues have arisen regarding the calibration of item difficulty, sample dependence of coefficient measures, and estimates of measurement error. The Rasch model enables us to address those issues by (1) assessing the dimensionality of assessment; (2) identifying redundant items or items that measure a different construct or construct-irrelevant factors through the item-fit; (3) identifying items that should be flagged based on their difficulty levels; and (4) assessing whether response categories are appropriate for distinguishing items by their quality.
The Rasch Measurement Theory (RMT) is a psychometric model to analyze categorical data (particularly dummy variables) as a function of the person (e.g., rater or reviewer)'s ability on a trait and the item difficulty [17]. Andrich [45] then developed the Rasch Rating Scale Model (RSM, also called the Polytomous Rasch model) for polytomous data, which is data with more than two ordinal categories. The RSM provides estimates of person location on a continuous latent variable (a), item difficulties (b), and an overall set of thresholds that are fixed across items (c).
RMT obtains information from the person and the item to estimate the probability of a person with a given level of ability to answer a given item correctly, thus, connecting person ability to item difficulty [46]. This probabilistic framework allows RMT to be falsifiable and to meet the linearity assumptions of parametric statistical tests. Therefore, measures of fit statistics for both person-fit and item-fit can be obtained, which provide evidence of validity-how well the model can predict the response to each item.
In addition, RMT transforms ordinal data to logits, which allows a proper use of parametric statistical analysis, without assumption violation that is associated with Type I and II error inflation. Lastly, the item parameters estimated by RMT are generally invariant to the population used to generate these estimates. In other words, parameter estimates obtained from a sufficient sample should be equivalent to those obtained from another sufficient sample despite the average person's ability level in each of the samples [46]. This property of RMT allows for greater generalization of results as well as more sophisticated applications.

Psychometric properties in Rasch measurement model
For any quality test or assessment, the supporting evidence must have three psychometric properties-validity, reliability, and fairness [47]. This section briefly reviews how each of these psychometric properties can be assessed when using the RMT. In this study, our focus is on reliability and validity.
Reliability. Reliability refers to the consistency or precision of scores across replications of a testing procedure. Under RMT, the Rasch-based reliability index, called the reliability of separation, is used to measure the reliability of a test or assessment. A reliability of separation index is obtained based on latent measures with equal intervals along the underlying continuum and it reflects how distinct latent scores are along the scale, which ranges from 0 to 1. This is defined as: , where SD = standard deviation of Rasch measures of a specific facet (e.g., students, tasks, and raters) and MSE = average Mean Squared Errors of Rasch measures for each facet. Higher values indicate higher reliability. High reliabilities are preferred because they indicate a good presentation of Rasch measures across the entire range of the latent scale.
Validity. Validity refers to the degree to which theory supports the interpretation of test scores 47 . Under the RMT, the Infit and Outfit Mean Square (MnSq) statistics can be used to evaluate how well the measures of an individual facet (i.e., item, study, and rater) fit the constructed latent scale (i.e., study quality score). In particular, the Infit MnSq identifies irregular response patterns, and the Outfit MnSq detects large residual values. The expected value for both Infit and Outfit MnSq statistics is 1.0, which shows a perfect fit to the underlying scale. The fit indices provide diagnostic information for identifying misfit elements on each facet (e.g., item, study, or rater), supporting the validity arguments of internal structure. Therefore, the validity can be rated on a scale ranging from A (item, study, or rater fits the scale very well) to D (item, study, or rater does not fit the scale). See Table 1 for the guidelines for interpreting the Infit and Outfit MnSq values.

Description of 49 studies on firefighter cancer incidence and mortality
The studies evaluated in this quality assessment were gathered for a meta-analysis project that examines cancer incidence and mortality risk among firefighters. The included studies were identified through a comprehensive literature search using multiple databases including ERIC, PsycINFO, ProQuest Dissertation & Theses, PUBMED, and MEDLINE via EBSCO, and online search engines including Embase, Web of Science Core collection, Google Scholar, and SCOPUS. A total of 49 studies were identified that met the inclusion and exclusion criteria.
Two independent raters were responsible for coding (1) study design characteristics, (2) outcome type, (3) cancer coding system, (4) cancer types, (5) source of occupation designations, (6) type of incident that firefighters attended, (7) sample characteristics, and (8) study characteristics. Two additional reviewers were responsible for coding the statistical estimates presented in these studies for computing a standardized incidence ratio and (2) a standardized mortality ratio.

Procedure
Two content experts on epidemiology independently rated 49 observational studies using RTI item banks and NOQAS. Two independent raters are: (1) a cancer epidemiologist who holds a PhD in Epidemiology and has 20 years of experience in cancer research and teaching; and (2) a chronic disease and occupational epidemiologist who holds PhD in preventive medicine and community health and has over 30 years of teaching and research experience and the Principal Investigator of the Florida cancer registry (Florida Cancer Data System).
The two study quality scales were first tested by the independent reviewers on a sample of studies (i.e., random sample of 5-7 studies) to ensure that consistent assumptions and criteria were employed by raters. Slight modifications were then made to the original quality assessments to better align with the methods of the studies evaluated, and some items were removed that were not relevant. The items evaluated along with their modifications (modifications are italicized) and specific instructions (13 RTI and 8 NOQAS items) are displayed in Table 2.

PLOS ONE
Study quality assessment scales: Rasch measurement model applied to the firefighter cancer literature

Representativeness of the exposed cohort-Newcastle1
Item is assessing the representativeness of exposed individuals in the community, not the representativeness of the sample of women from some general population. a) truly representative of the average _______________ (describe) in the community = 1 star b) somewhat representative of the average ______________ in the community = 1 star c) selected group of users eg nurses, volunteers = 0 stars d) no description of the derivation of the cohort = 0 stars 2. Selection of the non-exposed cohort-Newcastle2 a) drawn from the same community as the exposed cohort = 1 star b) drawn from a different source = 0 stars c) no description of the derivation of the non-exposed cohort = 0 stars. Note: In the case of general population can code = 1 star; if other occupational groups only then code 0.5 star given possible overlapping exposures.
3. Ascertainment of exposure-Newcastle3 a) secure record (e.g. surgical records) = 1 star b) structured interview = 1 star c) written self report = 0 stars d) No description = 0 stars Note: if self-report = 0 stars. Exposure based on registry/death records = 0.5 star. Disregard other confounders when assessing on this item.

Demonstration That Outcome of Interest Was Not Present at Start of Study-Newcastle4
In the case of mortality studies, outcome of interest is still the presence of a disease/ incident, rather than death. That is to say that a statement of no history of disease or incident earns a star. a) yes = 1 star b) no = 0 stars Note: For mortality studies code = 1 star if there if you can assume that persons were alive at enrollment into the cohort.

Comparability of Cohorts on the Basis of the Design or Analysis-Newcastle5
Age and one other control OR stratified analysis will qualify for two star rating a) study controls for ______ (select the most important factor) = 1 star b) study controls for any additional factor = additional 1 star OUTCOME 1. Assessment of outcome-Newcastle6 a) independent blind assessment (Independent or blind assessment stated in the paper, or confirmation of the outcome by reference to secure records = 1 star b) record linkage (e.g. identified through ICD codes on database records) = 1 star c) self-report (i.e. no reference to original medical records to confirm the outcome) = 0 stars d) no description = 0 stars Note: ICD version should be specified in order to earn 1 star.

Was follow-up long enough for outcomes to occur-Newcastle7
An acceptable length of time should be decided before quality assessment begins a) yes (select an adequate follow up period for outcome of interest) = 1 star b) no = 0 stars Note: Must mention > = 2 year lag; if not then assign = 0 stars

Adequacy of follow up of cohorts-Newcastle8
This item assesses the follow-up of the exposed and non-exposed cohorts to ensure that losses are not related to either the exposure or the outcome. a) complete follow up-all subjects accounted for = 1 star b) subjects lost to follow up unlikely to introduce bias = 1 star c) low follow up rate or no description of those lost = 0 stars d) no statement = 0 stars Note: No star assignment if loss to follow-up exceeds 10%; Active follow-up (e.g., last date of contact reported when the event could be verified) = 1 star; passive follow-up or no mention of follow-up = 0 stars)

RISK OF BIAS AND PRECISION OF OBSERVATIONAL STUDIES (RTI)
(Continued )

Is the length of time following the intervention/exposure sufficient to support the evaluation of primary outcomes and harms?-RTI11 [PI: Primary outcomes (including harms) should be identified for abstractors. Important measures may be listed separately. Abstractors should be provided with specific criteria for sufficient length of follow-up based on prior research or theory. Drop if entire body of evidence is cross-sectional or if minimal length of follow-up period is specified through inclusion criteria. Note: If cross-sectional list non-applicable. Otherwise in most cases this will be coded "yes" if the follow-up period only includes events taking place at least two years after joining the cohort. If the follow-up period includes events taking place less than two years or after joining the cohort then code "No". If there is no information provided on the follow-up period then report "Cannot determine"]
Yes / Partially: some primary outcomes are followed for a sufficient length of time / No / Cannot determine / Not applicable: cross-sectional (Continued )

Model specification
The FACETS computer program [48,49] for Rasch analysis, was used to examine the quality of two study quality assessments using a Many-facet Rating Scale Model (MFRM) [48]. The MFRM is expressed as below.
ln P jnmi;k P jnmi;ðkÀ 1Þ , where P jnmi,k = probability of study j receiving a rating k on item i; P jnmi,(k−1) = probability of study j receiving a rating k-1 on item i; θ j = quality measure of study j; δ i = difficulty of endorsing item i; τ k = difficulty of endorsing category k relative to k-1;

Analyses
The ratio between P jnmi,k and P jnmi,(k−1) specified in Eq 3 is called odds so that the log-odds (logits) are a linear combination of latent measures for different facets. Since all the measures are on a common scale with logits as the units, the MFRM can create measures on an additive interval scale. Higher logit values reflect higher quality for studies, and items that are more difficult to endorse. These values were presented using a Wright map to show an empirical display of study quality scores and item difficulties. In addition to logit values, we computed the reliability of separation indices for items; and study, rater, and Infit and Outfit MnSq statistics. The reliability of separation indices shows how reproducible the scale would be if using a different but equivalent study sample. Infit and Outfit MnSq statistics are used to demonstrate how well item and study fit the latent scale. Lastly, a Chi-square test is performed to examine if all items can be viewed as equal. A significant result indicates that the studies are distinct from each other.   (2)  These results indicate that on average, the RTI scale item banks produced much higher quality scores across 49 studies, when compared to the NOQAS. Figs 1, 2 display the wright map, which is an empirical display of the RTI scale (Fig 1) and NOQAS (Fig 2), respectively. In each figure, the first column shows the Rasch score on a logit scale. The items (column 4), raters (column 3), and the individual studies (column 2) are located on the wright map based on their Rasch score. The last column displays threshold estimates of response categories on the Likert scale. As shown in Fig 1, the latent Rasch scores of study quality measured by the

PLOS ONE
Study quality assessment scales: Rasch measurement model applied to the firefighter cancer literature RTI scale were skewed to the left, indicating that most studies appeared to be low in its study quality, while those Rasch scores measured by NOQAS followed a normal distribution (see Fig  2). Out of 13 items in the RTI scale item bank, 9 were above the mean of 0 (column 3 in Fig 1), indicating that most were quite difficult to evaluate. On the other hand, items on NOQAS

PLOS ONE
Study quality assessment scales: Rasch measurement model applied to the firefighter cancer literature were distributed relatively evenly in terms of item difficulty (column 3 in Fig 2), except item #5 (very easy; located at 5 standard deviations below the mean). Lastly, two raters (column 2) were quite consistent in evaluating the quality of primary studies using the NOQAS and RTI scale item bank in their ratings of study quality.
Psychometric evidence for the RTI items for firefighters' cancer literature Dimensionality. Results from the Many-facet Rating Scale Model (MFRM) indicated that there was one underlying factor that explained 79.82% of variances in 13 items. This result suggests that the RTI is unidimensional for measuring the quality of individual studies (> 20%) [49].
Reliability. The reliability of separation for RTI scale items was .99 (near 1.0), implying that the distribution of item measures can well represent the entire range of latent scale. The reliability value higher than .80 suggest that the RTI scale item bank scale does present acceptable reproducibility and consistency of the ordering of the Rasch scale scores.
Validity. As shown in Table 4, the Infit and Outfit MnSq for RTI scale items 6, 9, 10, and 12 were found to fall into a fit category of A, indicating a good fit of each item to the study quality scale. Although items 2 and 11 fell into a fit category of A using the Infit MnSq, item 11 did show high MnSq. Items 1, 3, 5, 7, 8 and 11 were less productive based on either Outfit or Infit MnSq. See Table 4.
The quality measures of the 49 studies were significantly different, χ 2 (46) = 177.7, p < .01. As shown in Table 5, study 29 had the highest study quality score, while study 8 had the lowest. Most studies fit the scale well with a fit category of A or B, with six studies falling into category C or D.

Psychometric evidence for the NOQAS items for firefighters' cancer literature
Dimensionality. The result from MFRM indicated that one underlying factor exists. This factor explained 41.41% of variances in 8 items, suggesting that the NOQAS is unidimensional for measuring the quality of individual studies (> 20%) [49].

PLOS ONE
Study quality assessment scales: Rasch measurement model applied to the firefighter cancer literature

PLOS ONE
Reliability. The reliability of separation for items was .99, implying that the NOQAS quality assessment scale does show acceptable reproducibility and consistency of the ordering of the Rasch scale scores.
Validity. As shown in Table 6, the Infit and Outfit MnSq for most items fell into a fit category of A, indicating a good fit of each item to the study quality scale. Exceptions were item 1 (B for Outfit MnSq), item 7 (C for Infit and Outfit MnSq). Particularly, item 7 had high Infit and Outfit MnSq values, showing that this item is unproductive and distorting. Content specialists should be consulted in terms of future uses of item 7 for this scale.
Results indicated that the study quality measures of these 49 studies were significantly different, χ 2 (46) = 71.9, p = .05. As shown in Table 7, study 12 had the highest study quality score, while study 40 had the lowest. Most studies fit the scale well with a fit category of A or B, with eight studies showing falling into category C or D.

Comparison between RTI and NOQAS for firefighters' cancer literature
The reliability of item separation index for the RTI scale and NOQAS were both found to be high (approximating 1), indicating that both scales are reproducible and consistent of the ordering of Rasch scale scores. In terms of rater agreement, the two coders rated study quality equally consistent using NOQAS than RTI, as shown in Tables 8 and 9. The NOQAS measures produced much lower Rasch latent study quality scores with a less variation, and their items

PLOS ONE
Study quality assessment scales: Rasch measurement model applied to the firefighter cancer literature were much easier to rate than the RTI. The NOQAS had more items that showed better fit between items and the overall quality scores. The reason for this could be that the NOQAS was adapted to assess the quality of firefighters' cancer literature, which may have enabled ratings to be more closely aligned and less varied. As a result, there may be better fits between items and the quality scores. Additionally, study quality scores measured by the NOQAS scale were found to follow normal distribution. Both measures were found to be unidimensional.

Discussion
Using firefighters' cancer literature, the current study is the first attempt to examine the psychometric properties of two commonly used study quality assessment measures using the Rasch measurement theory. Of many strengths, Rasch models can be used to (a) produce invariant study quality measures on a latent continuum, (b) assess the validity, reliability, and fairness of latent measures, and (c) use latent scores to explain variation in outcome measures. These characteristics of Rasch measurement theory offer practical applications in meta-analysis. Of many, study quality scores estimated by Rasch measurement model enable us to be directly compared across different studies and further modeled to explain variation in study effects by study quality scores. Our study found that the RTI scale and NOQAS were reproducible and consistent in evaluating the quality of firefighters' cancer literature, showing higher item reliability. In terms of interrater reliability, two raters were quite consistent in their assessment of study quality, when using both RTI and NOQAS scales. In terms of validity, we found that the NOQAS has more items that show better fit to the underlying construct of study quality than the RTI scale. This result indicates that NOQAS demonstrates better validity of internal structure to measure the quality of firefighters' cancer literature. Lastly, latent scores measured using NOQAS were distributed across all range of the latent scores, with much lower study quality scores with a smaller variation. These results suggest that NOQAS items are much easier to rate the quality of firefighters' cancer literature. Our findings accord with a previous study conducted by Margulis and her colleague [40], which concludes that RTI was harder to apply and thus produces much heterogenous quality scores than NOQAS.
The present study is significant in at least two major respects. First, the current study is the first in its kind that assesses the psychometric properties-reliability and validity-for two quality assessment tools that are most used in observational studies. Previous studies focused on interrater reliability of NOQAS and RTI scales, thus leaving the item reliability and validity of NOQAS and RTI unanswered. The current study provides the psychometric propertiesreliability and validity-of NOQAS and RTI for future use beyond interrater reliability. Second, more importantly, we used the Rasch Measurement theory (RMT) that produces the compatible quality scores of the included studies in meta-analysis, which further enhance its generalizability and applicability in meta-analysis. It is because that Rasch scores allow us to utilize parametric statistical analysis, which mostly assumes normal distribution. When utilizing the Rasch scores of NOQAS and RTI in a meta-analysis of firefighters' cancer incidence and mortality, we found that NOQAS scores significantly predict variation in the effect sizes. Specifically, results from a mixed-effects model indicate a significant and positive relationship between quality score and firefighters' cancer incidence and mortality. Lastly, the item parameters estimated by RMT are generally invariant to the population, which will offer greater generalization of meta-analytic results.
In this study, we did not address one of the important psychometric properties: whether NOQAS and RTI showed fairness in its assessment. If NOQAS and RTI are equally applicable to any study, it is expected that NOQAS and RTI scores are invariant regardless of study characteristics such as sampling method, funding sources, inclusiveness of samples, and whether a study used a good-quality instrument or not. Despite this limitation, the current study certainly adds to our understanding of the psychometric properties of NOQAS and RTI for future meta-analyses of the observational studies, similar to firefighters' cancer literature.